The practice will be based on the electoral data archives below, compiling data on elections to the Spanish Congress of Deputies from 2008 to the present, as well as surveys, municipalities codes and abbreviations. First steps:
Transform election_data and surveys databases to tidydata.
Discard polls that do not meet the base requirements.
Group the different party strands into their corresponding parent party.
Group those parties that do not fit into any of the above into others.
Use abbrev to stipulate party acronyms in election_data.
Homogenize data with respect to reality.
Distinguish between the two general elections in 2019.
Make "tidy" the "surveys" and "election_data" dataset and group the unique values of "abbrev":
After this first step, we noticed that the name of the parties changes between elections, so we ran a code to standardize those names to only one per political party.
# A tibble: 15 × 1
partido
<chr>
1 CIUDADANOS-PARTIDO DE LA CIUDADANIA
2 CIUDADANOS EN BLANCO
3 CIUDADANOS PARTIDO DE LA CIUDADANIA
4 PARTIDO CIUDADANOS UNIDOS DE ARAGON
5 CIUDADANOS DE CENTRO DEMOCRÁTICO
6 FORO DE CIUDADANOS
7 UNION DE CIUDADANOS INDEPENDIENTES DE TOLEDO
8 CIUDADANOS-PARTIDO DE LA CIUDADANÍA
9 CIUDADANOS PARTIDO DE LA CIUDADANÍA
10 CIUDADANOS-PARTIDO DE LA CIUDADANÍA (C's)
11 CIUDADANOS, PARTIDO DE LA CIUDADANÍA
12 CIUDADANOS LIBRES UNIDOS
13 CIUDADANOS RURALES AGRUPADOS
14 CIUDADANOS DE CENTRO DEMOCRÁTICO-CANDIDATURA INDEP
15 CIUDADANOS INDEPENDIENTES DE LINARES UNIDOS
#| message: false#| eval: true#| echo: true#| code-fold: trueelection_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("CIUDADANOS-PARTIDO DE LA CIUDADANIA","CIUDADANOS PARTIDO DE LA CIUDADANIA","CIUDADANOS-PARTIDO DE LA CIUDADANÍA","CIUDADANOS PARTIDO DE LA CIUDADANÍA","CIUDADANOS-PARTIDO DE LA CIUDADANÍA (C's)","CIUDADANOS, PARTIDO DE LA CIUDADANÍA" ),"CIUDADANOS-PARTIDO DE LA CIUDADANIA", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("PARTIDO POPULAR","PARTIT POPULAR/PARTIDO POPULAR","UNION DEL PUEBLO NAVARRO EN COALICION CON EL PARTIDO POPULAR","PARTIDO POPULAR/PARTIT POPULAR","PARTIDO POPULAR-EXTREMADURA UNIDA","PARTIDO POPULAR EN COALICIÓN CON EL PARTIDO ARAGONÉS","UNIÓN DEL PUEBLO NAVARRO EN COALICIÓN CON EL PARTIDO POPULAR","PARTIDO POPULAR (PP)","PARTIDO POPULAR EN COALICIÓN CON EL PARTIDO ARAGON","PARTIDO POPULAR-FORO","PARTIT POPULAR-PARTIDO POPULAR","PARTIDO POPULAR - FORO","PARTIDO POPULAR / PARTIT POPULAR" ),"PARTIDO POPULAR", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("CONVERGENCIA I UNIO","CONVERGÈNCIA I UNIÓ","CONVERGÈNCIA i UNIÓ" ),"CONVERGENCIA I UNIO", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("PARTIDO SOCIALISTA DE EUSKADI-EUSKADIKO EZKERRA","PARTIDO SOCIALISTA OBRERO ESPAÑOL","PARTIDO DOS SOCIALISTAS DE GALICIA-PARTIDO SOCIALISTA OBRERO ESPAÑOL","PARTIDO SOCIALISTA DE EUSKADI-EUSKADIKO EZKERRA (PSOE)","PARTIDO SOCIALISTA OBRERO ESPAÑOL DE ANDALUCIA","PARTIDO DOS SOCIALISTAS DE GALICIA-PSOE","PARTIDO DOS SOCIALISTAS DE GALICIA - PSOE","PARTIDO DOS SOCIALISTAS DE GALICIA - PARTIDO SOCIALISTA OBRERO ESPAÑOL","PARTIDO SOCIALISTA DE EUSKADI-EUSKADIKO EZKERRA (P","PARTIDO SOCIALISTA OBRERO ESPAÑOL-NUEVA CANARIAS","PARTIDO SOCIALISTA DE EUSKADI-EUSKADIKO EZKERRA(PSOE)","PARTIT SOCIALISTA OBRER ESPANYOL","PARTIDO DOS SOCIALISTAS DE GALICIA-PARTIDO SOCIALI" ),"PARTIDO SOCIALISTA OBRERO ESPAÑOL", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO","PARTIDO NACIONALISTA VASCO" ),"EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("IZQUIERDA UNIDA-ALTERNATIVA","ESQUERRA UNIDA DEL PAIS VALENCIA-IZQUIERDA REPUBLICANA: ESQUERRA UNIDA I REPUBLICANA","IZQUIERDA UNIDA LOS VERDES-CONVOCATORIA POR ANDALUCIA - ALTERNATIVA","ESQUERRA UNIDA - ELS VERDS","INICIATIVA PER CATALUNYA VERDS-ESQUERRA UNIDA I ALTERNATIVA","IZQUIERDA UNIDA LOS VERDES CONVOCATORIA POR ANDALUCIA-ALTERNATIVA","ESQUERDA UNIDA-IZQUIERDA UNIDA-ALTERNATIVA","IZQUIERDA UNIDA LOS VERDES-CONVOCATORIA POR ANDALUCIA-ALTERNATIVA","INICIATIVA PER CATALUNYA VERDS - ESQUERRA UNIDA I ALTERNATIVA","IZQUIERDA UNIDA COMUNIDAD DE MADRID-ALTERNATIVA","IZQUIERDA UNIDA DE LA REGION DE MURCIA-ALTERNATIVA","IZQUIERDA UNIDA DE NAVARRA - NAFARROAKO EZKER BATUA - ALTERNATIVA","IZQUIERDA UNIDA - BLOQUE POR ASTURIES - LOS VERDES","IZQUIERDA UNIDA CANARIA-ALTERNATIVA","ESQUERRA UNIDA DEL PAIS VALENCIA-IZQUIERDA REPUBLICANA:ESQUERRA UNIDA I REPUBLICANA","IZQUIERDA UNIDA DE CEUTA","IZQUIERDA UNIDA-LOS VERDES: LA IZQUIERDA PLURAL/EZKER ANITZA (IU-LV)","IZQUIERDA UNIDA DE CASTILLA-LA MANCHA-LOS VERDES: LA IZQUIERDA PLURAL","ESQUERRA UNIDA DEL PAIS VALENCIA-LOS VERDES: L'ESQUERRA PLURAL","IZQUIERDA UNIDA LOS VERDES-CONVOCATORIA POR ANDALUCÍA: LA IZQUIERDA PLURAL(IULV-CA)","IZQUIERDA UNIDA DE CASTILLA Y LEON: LA IZQUIERDA PLURAL (IUCL)","IZQUIERDA UNIDA VERDES-SOCIALISTAS INDEPENDIENTES DE EXTREMADURA: LA IZQUIERDA PLURAL","ESQUERRA UNIDA ILLES BALEARS: LA IZQUIERDA PLURAL","INICIATIVA PER CATALUNYA VERDS-ESQUERRA UNIDA I ALTERNATIVA: L'ESQUERRA PLURAL","IZQUIERDA UNIDA DE CASTILLA Y LEÓN: LA IZQUIERDA PLURAL","IZQUIERDA UNIDA LOS VERDES-CONVOCATORIA POR ANDALUCÍA: LA IZQUIERDA PLURAL","ESQUERRA UNIDA PAÍS VALENCIÀ-ELS VERDS: L'ESQUERRA PLURAL","ESQUERDA UNIDA-OS VERDES: A ESQUERDA PLURAL","INICIATIVA PER CATALUNYA VERDS-ESQUERRA UNIDA i ALTERNATIVA: L'ESQUERRA PLURAL","IZQUIERDA UNIDA-LOS VERDES: LA IZQUIERDA PLURAL/EZKER ANITZA","CHUNTA ARAGONESISTA-IZQUIERDA UNIDA, LA IZQUIERDA DE ARAGÓN: LA IZQUIERDA PLURAL","IZQUIERDA UNIDA-LA IZQUIERDA PLURAL","IZQUIERDA UNIDA-LOS VERDES: LA IZQUIERDA PLURAL","IZQUIERDA UNIDA-VERDES DE LA REGION DE MURCIA: LA IZQUIERDA PLURAL","IZQUIERDA UNIDA DE ASTURIAS-IZQUIERDA XUNIDA D'ASTURIES:LA IZQUIERDA PLURAL","IZQUIERDA UNIDA CANARIA-INICIATIVA POR EL HIERRO-LOS VERDES: LA IZQUIERDA PLURAL","IZQUIERDA UNIDA DE CANTABRIA: LA IZQUIERDA PLURAL","CHUNTA ARAGONESISTA-IZQUIERDA UNIDA, La Izquierda de Aragón: La Izquierda Plural","IZQUIERDA UNIDA - PARTIDO DEMOCRÁTICO Y SOCIAL DE CEUTA: LA IZQUIERDA PLURAL","UNIDAD POPULAR- ORAIN BATERA:EZKER ANITZA-IZQUIERD","UNIDAD POPULAR: IZQUIERDA UNIDA, UNIDAD POPULAR EN","UNITAT POPULAR:ESQUERRA UNIDA DEL PAÍS VALENCIÀ, U","UNIDAD POPULAR: IZQUIERDA UNIDA LOS VERDES-CONVOCA","UNITAT POPULAR BALEARS: IZQUIERDA UNIDA, UNIDAD PO","UNIDAD POPULAR: IZQUIERDA UNIDA-UNIDAD POPULAR EN","UNITAT POPULAR: ESQUERRA UNIDA DEL PAÍS VALENCIÀ,","UNIDAD POPULAR: IZQUIERDA UNIDA LOS VERDES - CONVO","UNIDAD POPULAR-ORAIN BATERA:EZKER ANITZA-IZQUIERDA","UNIDAD POPULAR EN ARAGÓN: IZQUIERDA UNIDA, CHUNTA","UNIDAD POPULAR: UNIDAD POPULAR EN COMÚN, IZQUIERDA","UNIDAD POPULAR-ORAIN BATERA:IZQUIERDA UNIDA, BATZA","UNIDAD POPULAR: IZQUIERDA UNIDA. UNIDAD POPULAR E","UNIDAD POPULAR: IZQUIERDA UNIDA CANARIA, UNIDAD PO","UNIDAD POPULAR-IZQUIERDA UNIDA","EZKER BATUA-BERDEAK-ALTERNATIVA","PODEMOS-AHAL DUGU","PODEMOS","COMPROMÍS-PODEMOS-ÉS EL MOMENT", "PODEMOS-Ahora Alto Aragón en Común","PODEMOS/AHAL DUGU", "COMPROMÍS-PODEMOS-EUPV: A LA VALENCIANA", "EN COMÚ PODEM","EN COMÚ PODEM - GUANYEM EL CANVI","UNIDOS PODEMOS/ELKARREKIN AHAL DUGU","UNIDOS PODEMOS", "UNIDOS PODEMOS POR ANDALUCÍA","UNITS PODEM MÉS", "UNIDOS PODEMOS EN ALTO ARAGÓN","UNIDOS PODEMOS-ELKARREKIN AHAL DUGU", "UNIDOS PODEMOS/XUNÍOS PODEMOS", "UNIDOS PODEMOS EN ARAGÓN", "ELKARREKIN PODEMOS-UNIDAS PODEMOS", "UNIDAS PODEMOS", "UNIDAS PODEMOS-UNIDES PODEM", "EN COMÚN-UNIDAS PODEMOS", "UNIDAS PODEMOS-ALTOARAGÓN EN COMÚN", "UNIDAS PODEMOS-XUNIES PODEMOS", "UNIDAS PODEMOS-XUNÍES PODEMOS" ),"UNIDAS PODEMOS", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("BLOQUE NACIONALISTA GALEGO","BLOQUE NACIONALISTA GALLEGO"),"BLOQUE NACIONALISTA GALEGO", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("ESQUERRA REPUBLICANA DE CATALUNYA", "ESQUERRA REPUBLICANA", "ESQUERRA REPUBLICANA DE CATALUNYA-CATALUNYA SÍ", "ESQUERRA REPUBLICANA/CATALUNYA SÍ", "ESQUERRA REPUBLICANA DE CATALUNYA-SOBIRANISTES"),"ESQUERRA REPUBLICANA DE CATALUNYA", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("EH-BILDU", "EUSKAL HERRIA BILDU", "ARALAR", "PARTIDO POLITICO ARALAR", "AMAIUR", "NAFARROA BAI", "GEROA BAI", "ONGI ETORRI", "SORTU", "EUSKO ALKARTASUNA", "ALTERNATIBA", "EUSKAL HERRIA"),"EUSKAL HERRIA BILDU", partido ) )election_data_tidy <- election_data_tidy |>mutate(partido =if_else( partido %in%c("MÁS PAÍS", "MÁS PAÍS-ANDALUCÍA", "MÁS PAÍS-EQUO", "MÁS PAÍS-CANDIDATURA ECOLOGISTA","MÁS PAÍS-CHUNTA ARAGONESISTA-EQUO"),"MÁS PAÍS", partido ) )election_data_tidy <- election_data_tidy |>select(-siglas)partidos_excluidos <-c("CIUDADANOS-PARTIDO DE LA CIUDADANIA", "PARTIDO POPULAR", "CONVERGENCIA I UNIO", "PARTIDO SOCIALISTA OBRERO ESPAÑOL","EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO","BLOQUE NACIONALISTA GALEGO", "UNIDAS PODEMOS","ESQUERRA REPUBLICANA DE CATALUNYA", "EUSKAL HERRIA BILDU", "MÁS PAÍS", "VOX")election_data_tidy <- election_data_tidy |>mutate(partido =if_else(!partido %in% partidos_excluidos, "OTHERS", partido), ) |>group_by(anno, mes, tipo_eleccion, codigo_ccaa, codigo_provincia, codigo_municipio, codigo_distrito_electoral, numero_mesas, censo, participacion_1, participacion_2, votos_blancos, votos_nulos, votos_candidaturas, partido) |>summarise(votos =sum(votos, na.rm =TRUE), .groups ="drop")election_data_tidy <- election_data_tidy |>left_join(abbrev_unique |>select(denominacion, siglas), by =c("partido"="denominacion"))sapply(election_data_tidy, class)
anno mes tipo_eleccion
"numeric" "character" "character"
codigo_ccaa codigo_provincia codigo_municipio
"character" "character" "character"
codigo_distrito_electoral numero_mesas censo
"numeric" "numeric" "numeric"
participacion_1 participacion_2 votos_blancos
"numeric" "numeric" "numeric"
votos_nulos votos_candidaturas partido
"numeric" "numeric" "character"
votos siglas
"numeric" "character"
To verify the work we have just done, we check that the NAs in "siglas" are only found when "partido" = others:
mismatched_rows <- election_data_tidy |>filter(is.na(siglas) & partido !="OTHERS")if (nrow(mismatched_rows) >0) {print("Rows where 'siglas' is NA but 'partido' is not 'OTHER':")print(mismatched_rows)} else {print("No mismatched rows found. All NAs in 'siglas' correspond to 'OTHER' in 'partido'.")}
[1] "No mismatched rows found. All NAs in 'siglas' correspond to 'OTHER' in 'partido'."
To be able to analyze the geographical dimension of the electoral data, the table cod_mun was made available. However, we realized that the codes for some CCAA in this table did not exactly match the ones in any table found online. We used {datapasta} to paste a table from the INE which was close enough (https://www.ine.es/daco/daco42/codmun/cod_ccaa_provincia.htm).
Afterwards, we customized the column names and manually added the code info on Ceuta and Melilla, which could not be pasted through the function because there was another row with only one column above them, called “Ciudades autónomas”, that was stopping it.
We then used this code to check the class of the variables. We observe that the key columns are the same class across all tables, which will facilitate the join.
Code
sapply(cod_ine, class)
codigo_ccaa ccaa codigo_provincia provincia
"character" "character" "character" "character"
Code
sapply(election_data_tidy, class)
anno mes tipo_eleccion
"numeric" "character" "character"
codigo_ccaa codigo_provincia codigo_municipio
"character" "character" "character"
codigo_distrito_electoral numero_mesas censo
"numeric" "numeric" "numeric"
participacion_1 participacion_2 votos_blancos
"numeric" "numeric" "numeric"
votos_nulos votos_candidaturas partido
"numeric" "numeric" "character"
votos siglas
"numeric" "character"
Code
sapply(cod_mun, class)
cod_mun municipio
"character" "character"
We used a code like this with some easy municipalities we knew the name and the CCAA of, in order to check the correspondence between the two tables.
Code
cod_mun |>filter(municipio =="Lugo") |># for examplepull(cod_mun)
[1] "11-27-028"
We concluded that:
Andalucía, Aragón, Asturias, Baleares, Cantabria, Cataluña, Ceuta and Melilla share the same code between the two tables.
The two Castillas are interchanged: Castilla y León is 7 in the INE table and 8 in our elections database. Castilla-La Mancha is 8 in the INE and 7 in our database.
Comunidad Valenciana is 10 in the INE and 17 in our database.
Extremadura is 11 in the INE and 10 in our database.
Galicia is 12 in the INE and 11 in our database.
Madrid is 13 in the INE and 12 in our database.
Murcia is 14 in INE, 15 in our database.
Navarra is 15 in INE, 13 in our database.
País Vasco is 16 in INE, 14 in our database.
La Rioja is 17 in INE, 16 in our database.
Code
cod_ine <- cod_ine |>mutate(codigo_ccaa =case_when( codigo_ccaa =="07"~"08", # Castilla y León codigo_ccaa =="08"~"07", # Castilla-La Mancha codigo_ccaa =="10"~"17", # Comunidad Valenciana codigo_ccaa =="11"~"10", # Extremadura codigo_ccaa =="12"~"11", # Galicia codigo_ccaa =="13"~"12", # Madrid codigo_ccaa =="14"~"15", # Murcia codigo_ccaa =="15"~"13", # Navarra codigo_ccaa =="16"~"14", # País Vasco codigo_ccaa =="17"~"16", # La RiojaTRUE~ codigo_ccaa # Keep the value of codigo_ccaa for the rest of the CCAA )) |>mutate(ccaa =sub(",.*", "", ccaa))
We used this to make the codes between the cod_ine and cod_mun table match, and we also simplified the names of the CCAA (the code used subtracts anything after the comma for long names).
Code
election_data_tidy <- election_data_tidy |>left_join(cod_ine, by =c("codigo_ccaa"="codigo_ccaa", "codigo_provincia"="codigo_provincia"))
We split the variable cod_mun into the three different codes it contains, to be able to do the left join (because some municipalities from different CCAA and province share the same codigo_municipio).
Code
cod_mun <- cod_mun |>separate(cod_mun, into =c("codigo_ccaa", "codigo_provincia", "codigo_municipio"), sep ="-", remove =FALSE)election_data_tidy <- election_data_tidy |>left_join(cod_mun, by =c("codigo_ccaa", "codigo_provincia", "codigo_municipio"))
We then used this code to check if there were any NAs resulting from the joins.
Code
colSums(is.na(election_data_tidy))
anno mes tipo_eleccion
0 0 0
codigo_ccaa codigo_provincia codigo_municipio
0 0 0
codigo_distrito_electoral numero_mesas censo
0 0 0
participacion_1 participacion_2 votos_blancos
0 0 0
votos_nulos votos_candidaturas partido
0 0 0
votos siglas ccaa
0 48737 0
provincia cod_mun municipio
0 0 0
It seems like the tables joined nicely, so now our dataset. Lastly, we ran this code to check the elections date:
# A tibble: 6 × 2
# Groups: anno [5]
anno mes
<dbl> <chr>
1 2008 03
2 2011 11
3 2015 12
4 2016 06
5 2019 04
6 2019 11
With this code, we realize that, in 2019, there were two elections. Thus, to avoid later addition problems, we create the variable annomes, in order to account for this problem.
We filter election data for municipalities with census>100,000, calculate total votes for each party, and identify the party with the highest votes per municipality and election period.
# A tibble: 26 × 3
# Groups: partido [7]
partido annomes overall_wins
<chr> <chr> <int>
1 PARTIDO POPULAR 201111 39
2 PARTIDO SOCIALISTA OBRERO ESPAÑOL 201904 38
3 PARTIDO POPULAR 201606 36
4 PARTIDO SOCIALISTA OBRERO ESPAÑOL 201911 32
5 PARTIDO POPULAR 201512 27
6 PARTIDO POPULAR 200803 23
7 PARTIDO SOCIALISTA OBRERO ESPAÑOL 200803 19
8 UNIDAS PODEMOS 201512 11
9 OTHERS 201512 8
10 OTHERS 201606 7
# ℹ 16 more rows
With that, it is easier to outline the performance of political parties. The PP achieved its highest recorded win (39) in big cities in November 2011, followed by 36 in June 2016 and 27 in December 2015.
Similarly, the PSOE reached its highest (38 wins) in April 2019 and secured 32 wins in November 2019 in big municipalities.
Minor parties, like Unidas Podemos and other parties, also reached high results during December 2015, with 11 and 8 wins, respectively. It could be concluded that the results show the political shift in Spain after the 15M movement.
Which party was the second when the first was the PSOE? And when the first was the PP?
Then, for the PSOE, the data are grouped by date (year-month), municipality and party, adding up the votes and sorting by totals. The result is filtered to obtain the municipality and date where the PSOE is the first party, and then the second most voted party in those same municipalities is identified. Then, the process is repeated for the PP, performing the same analysis but focusing on the PP as the first party. Finally, the code shows how many times each party was the second most voted party in these contexts.
# A tibble: 9 × 2
partido times_second
<chr> <int>
1 PARTIDO SOCIALISTA OBRERO ESPAÑOL 20352
2 UNIDAS PODEMOS 1804
3 CIUDADANOS-PARTIDO DE LA CIUDADANIA 1469
4 VOX 835
5 OTHERS 717
6 EUSKAL HERRIA BILDU 119
7 BLOQUE NACIONALISTA GALEGO 57
8 EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO 39
9 CONVERGENCIA I UNIO 4
Code
print(second_party_when_psoe_first)
# A tibble: 8 × 2
partido times_second
<chr> <int>
1 PARTIDO POPULAR 10934
2 UNIDAS PODEMOS 1040
3 VOX 680
4 CIUDADANOS-PARTIDO DE LA CIUDADANIA 649
5 OTHERS 319
6 EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO 73
7 EUSKAL HERRIA BILDU 21
8 BLOQUE NACIONALISTA GALEGO 1
The first table shows the parties that were the second most voted in the municipalities where PP was the first. PSOE ranked second the most, with 20,352 occasions, followed by Unidas Podemos with 1,804. Regional parties such as Euskal Herria Bildu (119), Bloque Nacionalista Galego (57) and Euzko Alderdi Jeltzalea-PNV (39) have a significantly lower number of ending second.
When PSOE is the most voted party, PP is the second most voted party on 10,934 occasions. United Podemos follows with 1,040, showing a significant presence. VOX and Ciudadanos also appear as the second most voted in several municipalities, with 680 and 649 occasions respectively.
This suggests that bipartidism in Spain is still a key component of the electoral procces, and that there are some regional dynamics that influence electoral support.
Code
library(ggalluvial)second_party_all <-bind_rows( second_party_when_psoe_first %>%mutate(first_party ="PSOE"), second_party_when_pp_first %>%mutate(first_party ="PP")) second_party_all <- second_party_all %>%left_join(abbrev_unique, by =c("partido"="denominacion")) %>%mutate(siglas =ifelse(is.na(siglas), "OTHERS", siglas))colores_partidos <-c("PP"="#004a95","PSOE"="#e30613","OTHERS"="#100F0F","CIU"="#727dff","PODEMOS-EUP"="#782c67","ERC"="#ffbf41","EAJ-PNV"="#008035","BNG"="#d41c54","EH-BILDU"="#04ccb4","C's"="#fc5204","M PAÍS"="#0a7262","VOX"="#5ac035")ggplot(second_party_all, aes(axis1 = first_party, axis2 = siglas, y = times_second)) +geom_alluvium(aes(fill = siglas), width =1/12, alpha =0.7) +geom_stratum(width =1/12, fill ="lightgray", color ="black") +geom_text(stat ="stratum", aes(label =after_stat(stratum)), size =2, color ="black", fill) +labs(title ="Relations between First and Second Place Parties (PSOE and PP)",fill ="Party" ) +scale_fill_manual(values = colores_partidos) +theme_minimal() +theme(axis.title.x =element_blank(), axis.title.y =element_blank(), axis.text.x =element_blank(), axis.text.y =element_blank(), plot.title =element_text(size =14, face ="bold", hjust =0.5), plot.subtitle =element_text(size =10, hjust =0.5, color ="gray"), legend.title =element_text(size =10), legend.text =element_text(size =9), strip.text =element_text(size =8), panel.grid =element_blank(), panel.border =element_blank() )
Who benefits from low turnout?
The first step is to calculate the turnout in each municipality for each election.
According to the Spanish Ministry of Internal Affais, the average participation in the Spanish general elections has been around 70%. Therefore, we will be considering low turnout as less than 55% of votes regarding the census.
# A tibble: 9 × 2
partido times_win_low_turnout
<chr> <int>
1 PARTIDO POPULAR 237
2 PARTIDO SOCIALISTA OBRERO ESPAÑOL 151
3 EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO 90
4 EUSKAL HERRIA BILDU 87
5 OTHERS 54
6 CONVERGENCIA I UNIO 22
7 UNIDAS PODEMOS 16
8 VOX 10
9 ESQUERRA REPUBLICANA DE CATALUNYA 5
The results show that Partido Popular and PSOE (bipartidism parties) are the most benefited from low turnout with 237 and 151 wins respectively. They are followed by regional parties from the País Vasco, with 90 wins for PNV and 87 wins for EH-Bildu.
Overall, the outcome indicates that larger, well-established parties have achieved more victories compared to smaller/newer parties.
How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?
The first step to examine this relationship is to look at the correlation between variables. It is important to note that this does not imply causality, but it does give us an indication of how the variables are related to each other.
Code
library(dplyr)#New variable to calculate the proportion of voteselection_data_tidy <- election_data_tidy |>mutate(proporcion_votos = votos / censo) #Correlation Coefficient with absolute valuescor(election_data_tidy$censo, election_data_tidy$votos, use ="complete.obs")
[1] 0.5288495
Code
#Correlation Coefficient with the proportioncor(election_data_tidy$censo, election_data_tidy$proporcion_votos, use ="complete.obs")
[1] -0.002263282
The correlation of 0.55 between the census and total votes indicates a moderate positive relationship. This is expected, as a higher number of people registered in a region tends to be associated with a higher number of votes. However, this relationship is not perfectly linear, suggesting that other factors also influence the total number of votes.
On the other hand, the correlation of -0.0026 between the census and the proportion of votes indicates a virtually non-existent relationship. This suggests that, once the number of votes is normalized by the size of the census, the population size no longer becomes a determining factor.
To visually represent this relationship, we created a graph.
Code
ggplot(election_data_tidy, aes(x = censo, y = proporcion_votos, color =as.factor(annomes))) +geom_line(alpha =0.6) +# Líneas para mostrar la relación generalgeom_point(alpha =0.6, size =1) +# Puntos para destacar cada observaciónlabs(title ="Relationship Between Census Size and Vote Proportion",x ="Census Size",y ="Vote Proportion",color ="Year" ) +scale_color_brewer(palette ="Set1") +theme_minimal()
Continuing to address the question regarding rural areas, we will use the National Geographic Institute as a reference, which defines population centers with more than 10,000 inhabitants as cities. Therefore, we begin by creating a new variable to classify municipalities as either rural or urban.
# A tibble: 24 × 4
# Groups: categoria_region, partido [24]
categoria_region partido siglas votos_totales
<chr> <chr> <chr> <dbl>
1 Rural PARTIDO POPULAR PP 12344495
2 Rural PARTIDO SOCIALISTA OBRERO ESPAÑOL PSOE 11248698
3 Rural OTHERS OTHERS 4181967
4 Rural UNIDAS PODEMOS PODEM… 4058743
5 Rural CIUDADANOS-PARTIDO DE LA CIUDADANIA C's 2319876
6 Rural VOX VOX 1524008
7 Rural ESQUERRA REPUBLICANA DE CATALUNYA ERC 1156711
8 Rural EUSKAL HERRIA BILDU EH Bi… 604126
9 Rural EUZKO ALDERDI JELTZALEA-PARTIDO NACION… EAJ-P… 577727
10 Rural CONVERGENCIA I UNIO CIU 556797
# ℹ 14 more rows
When examining the total votes by party in rural areas, we observe that the Partido Popular (PP) has been the most voted, surpassing the Partido Socialista Obrero Español (PSOE) by over a million votes, a significant difference. However, the most noteworthy aspect is the dominance of these two parties compared to others, which were less significant until 2015. On the other hand, it is worth noting that if we also consider the results from urban areas, we observe a similar trend.
For a deeper analysis, it is necessary to examine the differences between these parties across various elections.
Code
#Results per election and party in the rural arearesultados_rural_por_eleccion <- election_data_tidy %>%group_by(annomes, categoria_region, partido, siglas) %>%summarise(votos_totales =sum(votos), .groups ="drop") %>%filter(categoria_region =="Rural") %>%arrange(annomes, -votos_totales)print(resultados_rural_por_eleccion)
# A tibble: 72 × 5
annomes categoria_region partido siglas votos_totales
<chr> <chr> <chr> <chr> <dbl>
1 200803 Rural PARTIDO POPULAR PP 2755171
2 200803 Rural PARTIDO SOCIALISTA OBRERO ESPA… PSOE 2747352
3 200803 Rural OTHERS OTHERS 588878
4 200803 Rural CONVERGENCIA I UNIO CIU 242516
5 200803 Rural UNIDAS PODEMOS PODEM… 238964
6 200803 Rural ESQUERRA REPUBLICANA DE CATALU… ERC 102239
7 200803 Rural EUZKO ALDERDI JELTZALEA-PARTID… EAJ-P… 93137
8 200803 Rural BLOQUE NACIONALISTA GALEGO BNG 79747
9 200803 Rural EUSKAL HERRIA BILDU EH Bi… 59198
10 200803 Rural CIUDADANOS-PARTIDO DE LA CIUDA… C's 4418
# ℹ 62 more rows
The results confirm that traditional parties like the Partido Popular (PP) and the Partido Socialista Obrero Español (PSOE) consistently dominate the rural vote in Spain. The PP led in most elections, particularly in 2008, 2011, 2015, and 2016, showcasing strong support in rural areas. However, in the 2019 elections, significant changes emerged.
In April 2019, PSOE overtook PP as the leading party in rural areas, reflecting a shift in voter preference. Additionally, VOX gained considerable ground in 2019, especially in the November election, becoming the third most voted party and surpassing more established regional and national parties. These results highlight the evolving political dynamics in rural areas, where traditional dominance is occasionally challenged by emerging parties.
We will represent this graphically to visually observe these differences and attempt to identify other trends.
Code
# Colors for each partycolores_partidos <-c("PARTIDO POPULAR"="#004a95","PARTIDO SOCIALISTA OBRERO ESPAÑOL"="#e30613","OTHERS"="#100F0F","CONVERGENCIA I UNIO"="#727dff","UNIDAS PODEMOS"="#782c67","ESQUERRA REPUBLICANA DE CATALUNYA"="#ffbf41","EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO"="#008035","BLOQUE NACIONALISTA GALEGO"="#d41c54","EH-BILDU"="#04ccb4","CIUDADANOS-PARTIDO DE LA CIUDADANIA"="#fc5204", # Fixed name"MÁS PAÍS"="#0a7262","VOX"="#5ac035")library(ggplot2)ggplot(resultados_rural_por_eleccion, aes(x =factor(annomes), y = votos_totales, fill = partido)) +geom_bar(stat ="identity", position ="dodge") +scale_fill_manual(values = colores_partidos) +# Use the named vector herelabs(title ="Votos por partido en áreas rurales (2008-2019)", x ="Elecciones", y ="Votos Totales") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
The plot highlights the consistent dominance of PP and PSOE in rural areas. From 2008 to 2016, the PP emerges as the clear leader, particularly in 2011, where it reached its highest vote count. The PSOE maintained a competitive position, often coming in second. However, 2019 marks a turning point, as the PSOE surpasses the PP in the April election, showcasing a shift in voter preferences.
The trend illustrates the gradual decline of bipartisanship. Until 2015, the concentration of votes was primarily shared between PP and PSOE, reflecting a strong two-party system. From 2015 onward, the vote distribution becomes more diverse, with emerging parties like Unidas Podemos, Ciduadanos and later, VOX gaining significant traction, especially in the 2019 November election. The presence of regional parties, such as Esquerra Republicana de Catalunya and others, remains relatively stable but secondary compared to the national parties.
This evolving dynamic underlines the changing political landscape in rural Spain, with new actors challenging the traditional dominance of PP and PSOE. It is also necessary to mention that the “Others” category, which aggregates votes from various smaller parties (PACMA, La Falange…), starts to show significant relevance from 2016 onward. However, it is important to note that this category represents the sum of many smaller parties, which can fluctuate over time and dilute the overall impact of any single party in this group.
How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?
First we need to calculate the percentage of each party in the elections:
Code
#Calculate the total valid votes per yearelection_data_tidy <- election_data_tidy %>%mutate(votos_validos = votos_candidaturas + votos_blancos) #party votes + blank votes#Now we can calculate the percentage of votes per partyelection_data_nacional <- election_data_tidy %>%group_by(anno, annomes, partido, siglas) %>%summarise(votos_totales =sum(votos), votos_validos_totales =sum(votos_validos)) %>%mutate(porcentaje_votos_nacional = (votos_totales / votos_validos_totales) *100)# Preparing the data for joininglibrary(lubridate) surveys_tidy <- surveys_tidy %>%rename(siglas = partido) %>%mutate(date_elec =year(date_elec)) %>%filter(date_elec >=2008) comparacion <-left_join(election_data_nacional, surveys_tidy, by =c("siglas", "anno"="date_elec"))
The first step to calibrate the error is to calculate the absolute error between the percentage obtained by the parties and the voting intention in the surveys.
Code
#Calculation of errorcomparacion <- comparacion %>%mutate(error_absoluto =abs(porcentaje_votos_nacional - voto_intencion))#The Root Mean Square Error (RMSE), a common measure of this difference, can also be calculated by squaring the differences, averaging them, and taking the square root.rmse <-sqrt(mean((comparacion$porcentaje_votos_nacional - comparacion$voto_intencion)^2, na.rm =TRUE))print(rmse)
[1] 3.349456
The RMSE is 3.35, this means that on average, the difference between the predicted voting intentions from the polls and the election results is 3.35 percentage points.
Which polling houses got it right the most and which ones deviated the most from the results?
We can see differences in accuracy among the polling houses. Those that have made the least errors are IBES with an RMSE of 0.51, METRA SEIS with 0.76, APPEND with 1.89, and TOP POSITION with 2.22. The polling houses with the highest errors are OBRADOIRO DE SOCIOLOXÍA with an RMSE of 4.83, DEMOMÉTRICA with 5.16, and ASEP with 5.36.
We will plot those with an RMSE greater than 3.35 in red, and the others in blue. As we saw earlier, the overall RMSE for all polling houses was 3.35, so I will highlight the polling houses with an RMSE above this value in red, indicating that their predictions were less accurate.
In summary, the polling house with the highest error is ASEP in 2011, with an error of 11.72, marking a significant deviation from the actual results. Other notable polling houses include INVYMARK in 2019, with an error of 5.15, and DEMOMÉTRICA in 2008, with an error of 4.50.
Some polling houses appear consistently with relatively high errors. For example, INVYMARK stands out for its repeated presence in 2008, 2011, and 2019, with high error rates in each of those years. Similarly, METROSCOPIA appears multiple times, including in 2015 with an error of 4.33. Polling houses such as SIMPLE LÓGICA and DYM also show recurring presence in years like 2011 and 2016, indicating a pattern of notable deviations across different elections.
Original questions
Create a function that returns, as an outcome, the ranking of parties by number of votes in each CCAA and election. Then, put it to test with some examples.
The inputs of this custom function are the autonomous region of interest, and the grouped variable “annomes” which identifies the year and month of each election. This function is only applied to our table election_data_tidy, which has all of the information we need in the format we have given it. We filtered out the parties that had 0 votes, because we assume that they did not run for election that year and/or in that CCAA. The output is a table with the names of the parties and the total votes in said region, in descending order.
We can now test our function for the following questions:
What was the most voted party in the 2008 elections in Andalucía?
# A tibble: 5 × 2
partido votos_totales
<chr> <dbl>
1 PARTIDO SOCIALISTA OBRERO ESPAÑOL 2312900
2 PARTIDO POPULAR 1713783
3 UNIDAS PODEMOS 229159
4 OTHERS 163602
5 CIUDADANOS-PARTIDO DE LA CIUDADANIA 3759
It was PSOE, with 2,312,900 votes.
And the second most voted in 2011 in País Vasco?
# A tibble: 6 × 2
partido votos_totales
<chr> <dbl>
1 EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO 323591
2 EUSKAL HERRIA BILDU 284628
3 PARTIDO SOCIALISTA OBRERO ESPAÑOL 254064
4 PARTIDO POPULAR 210063
5 OTHERS 74539
6 UNIDAS PODEMOS 19404
EH-Bildu, with about 40,000 less votes than the first party, PNV.
And the least voted in Comunitat Valenciana, in 2015?
# A tibble: 6 × 2
partido votos_totales
<chr> <dbl>
1 PARTIDO POPULAR 837068
2 UNIDAS PODEMOS 782694
3 PARTIDO SOCIALISTA OBRERO ESPAÑOL 530504
4 CIUDADANOS-PARTIDO DE LA CIUDADANIA 423558
5 OTHERS 76461
6 VOX 7240
It was Vox, with 7,240 votes.
Did the order of the parties in the ranking change much in Extremadura from the first to the second elections in 2019?
# A tibble: 6 × 2
partido votos_totales
<chr> <dbl>
1 PARTIDO SOCIALISTA OBRERO ESPAÑOL 249555
2 PARTIDO POPULAR 140249
3 CIUDADANOS-PARTIDO DE LA CIUDADANIA 117720
4 VOX 70611
5 UNIDAS PODEMOS 62222
6 OTHERS 9837
# A tibble: 6 × 2
partido votos_totales
<chr> <dbl>
1 PARTIDO SOCIALISTA OBRERO ESPAÑOL 226679
2 PARTIDO POPULAR 153945
3 VOX 99546
4 UNIDAS PODEMOS 53677
5 CIUDADANOS-PARTIDO DE LA CIUDADANIA 44837
6 OTHERS 7142
In this case, the first and second most voted parties (PSOE and PP) remained in place, but Ciudadanos fell from 3rd to 5th place, leading Vox and UP to move up one position in the ranking, from 4th to 3rd and 5th to 4th respectively.
Similarly, create a function to explore the evolution of a party in a given region throughout all of the elections in the dataset, and use it to analyse the evolution of the party Vox in the Autonomous Community of Madrid.
The inputs of the function are the CCAA and the abbreviation of the party’s name (siglas), since it is more straightforward and reproduceable than the complete name. The output is a table that displays the year and month of the election, the total votes for that party, the amount of valid votes, and the percentage of votes for that party, in the CCAA selected.
The table generated with our custom function allows us to observe the electoral evolution of Vox. The party received no votes in the 2008 and 2011 elections, which indicates that it was founded between 2011 and 2015: it did not exist for the first two year of our dataset.
In 2015, its first year running for election, Vox obtained 22,456 votes, which meant a 0.63% of the total valid votes. The party’s results declined a little in the 2016 elections, receiving about 6,000 less votes, which translates into a 0.48% of the total votes. In Spain’s electoral system there’s an electoral threshold of 3%: this means that parties must gather at least 3% of the total votes in order to gain representation in Congress.
In that sense, Vox had a considerable blowout in April of 2019, when it entered the Congress with almost 14% of the people’s votes. The elections had to be repeated seven months later due to the lack of majorities needed to form a government. The November elections only served to consolidate the position of the far-right formation in the Spanish party system.
What are the 10 municipalities with the higher average turnout across all elections? And with the lower?
Given that we have the voting information broken down to the level of municipalities, we might also ask questions related to these entities.
# A tibble: 10 × 5
municipio provincia ccaa poblacion_media participacion_media
<chr> <chr> <chr> <dbl> <dbl>
1 Illán de Vacas Toledo Castil… 4 100
2 Villarroya Rioja, La Rioja 7.67 100
3 Cellorigo Rioja, La Rioja 12.5 97.4
4 Ledesma de la Cogolla Rioja, La Rioja 16.8 97.3
5 Salcedillo Teruel Aragón 9.17 96.7
6 Estepa de San Juan Soria Castil… 8.33 96.3
7 Tortuero Guadalajara Castil… 22.8 96.2
8 Cidamón Rioja, La Rioja 24.2 95.8
9 Barrio de Muñó Burgos Castil… 27.3 95.6
10 Almohaja Teruel Aragón 21.2 95.2
The analysis reveals that the 10 municipalities with the highest voter turnout across elections are concentrated in four Autonomous Regions: Castilla-La Mancha, La Rioja, Castilla y León, and Aragón, with two cases achieving 100% participation. These municipalities are extremely small, each with fewer than 30 inhabitants, which likely contributes to their high turnout rates.
The municipalities with the lowest turnout show greater regional diversity but are primarily in northern Spain, including Navarra (four cases) and Canarias (two cases). These areas, while rural, have larger and more heterogeneous populations.
Overall, the data highlights that voter turnout is not strongly correlated with population size but is instead shaped by complex interrelations of geographical, economic, cultural, social, and political factors.
Which is the autonomous community with the highest proportion of invalid votes with respect to the total number of votes across all the elections and how does this compare with the turnout?
To answer the question, by grouping the data by year, month, type of election, and geographic codes, we calculate the total number of vote within each group.Additionally, we compute the turnout by dividing the total votes by the census size.
By regrouping the data by autonomous community and year, we calculated the total votes and null votes. Additionally, we determined the average participation rate as the ratio of total votes to the census size for each community and year, providing an overall measure of voter engagement. The results are then sorted in descending order of this proportion, allowing for an easy identification of the communities with the highest rates of invalid votes.
Finally, we identified the community with the highest proportion of invalid votes and the turnout.
The autonomous community with the highest rate of null votes in all elections is Canarias, with a rate of 0.21534431438351 and turnout of 0.0503965798794663, recorded in the elections of 2011.
How voting intention affects turnout by media type?
Media types shape how voting intention affects turnout. Let’s first classify the media into different groups based on the media outlet. This categorization helps us group the data by media type for further analysis.
Once we have the different media groups, we will we will prepare and analyzing survey data to understand how the voting intention affect turnout rate depending on the media type the survey was conducted. For this, we process survey data by extracting the year and month from field_date_from and combining them into clave_fecha_encuesta. Filtering surveys from 2008 to 2019, we group data by media type, size, year, month, and voting intention, calculating the average turnout for each group.
If we interpret the results, we can conclude that:
Press: points show a broad range of voting intention and some dispersion in average participation.The trend line is slightly downward-sloping, suggesting a small decrease in average participation as voting intention increases.
Digital press: there is wide dispersion in terms of voting intention, but most points are concentrated at lower levels of voting intention.The trend line is almost flat, indicating no clear relationship between voting intention and average participation.
Agencies: points are grouped along a horizontal line around 75% on the Y-axis indicating that average participation does not vary significantly regardless of the reported voting intention.
Television: observations are very clustered and the line is completely horizontal showing a total absence of a relationship between the two variables.
Others: with a significant dispersion, there is a slight positive trend in the regression line: as voting intention increases, average participation also appears to increase.